Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Gene mutation split | Regex string

    Hello all,

    I have a column of co-occurring mutations by each observation and need to split this into multiple columns, each split corresponding to one mutation per obs. I managed to split the first and second/last but I am stumbling on going further picking the first instance of a ")".
    The string is too big for dataex to export but a sample of the column 'Cooccurmutn' is like this. The pattern I find useful is:
    1. gene names are always upper case starting letter and
    2. follow either a lower case letter ending or a ")".
    DNMT3A c.2645G>A, p.R882H (NM_022552.4) (VAF: 38%)IDH2 c.419G>A, p.R140Q (NM_002168.3) (VAF: 37%)JAK2 c.1849G>T, p.V617F (NM_004972.3) (VAF: 71%)
    CUX1 Loss - EquivocalWT1 Loss - EquivocalAPC Loss EquivocalBRCA1 Loss - EquivocalEZH2 Loss - EquivocalNF1 Loss
    SF3B1 c.2098A>G, p.K700E (NM_012433.4) (VAF: 12%)WT1 c.1156_1159dup, p.A387Vfs*4 (NM_024426.6) (VAF: 5%)ATR c.2634-1G>A, p.? (NM_001184.4) (VAF: 15%)STK11 Loss - Equivocal

    gen com = ustrregexs(0) if ustrregexm(Cooccurmutn, "(^[A-Z].*[a-z]$)") // first comutation
    replace com = ustrregexs(0) if ustrregexm(Cooccurmutn, "^[A-Z].*([a-z]|\))$") & missing(com) // first comutation
    replace com = ustrregexs(1) if ustrregexm(Cooccurmutn, "(^[A-Z]*.*[a-z]{4})([A-Z]*.*)") & missing(com) // first comutation
    replace com = ustrregexs(1) if ustrregexm(Cooccurmutn, "(^[A-Z]*.*([0-9]|%)\))([A-Z]*.*)") & missing(com) // first

    gen com2 =ustrregexra(Cooccurmutn, "(^[A-Z]*.*([0-9]|%)\))","") // second through end

    What I want for the first row/obs would be:
    comut1 comut2 comut3
    DNMT3A.* | IDH2.* |JAK2.*


    Alternatively, I had a file with master list of all human genes (hugolist.dta) which I was hoping to match and extract by looking up each row in my file to the column of genes in the genelist file, but don't know if that is easier in R or perhaps Stata. I am rusty with text functions in R though.

  • #2
    I don't really understand what you seek. What does DNMT3A.* | IDH2.* |JAK2.* mean in the context of what you seek? Can you not just copy and paste EXACTLY the values you want in your comut variables to help us understand what you are looking for?

    Perhaps this will start you in a useful direction - the first step is to split each string on each right parenthesis.
    Code:
    input str300 text
    "DNMT3A c.2645G>A, p.R882H (NM_022552.4) (VAF: 38%)IDH2 c.419G>A, p.R140Q (NM_002168.3) (VAF: 37%)JAK2 c.1849G>T, p.V617F (NM_004972.3) (VAF: 71%)"
    "CUX1 Loss - EquivocalWT1 Loss - EquivocalAPC Loss EquivocalBRCA1 Loss - EquivocalEZH2 Loss - EquivocalNF1 Loss"
    "SF3B1 c.2098A>G, p.K700E (NM_012433.4) (VAF: 12%)WT1 c.1156_1159dup, p.A387Vfs*4 (NM_024426.6) (VAF: 5%)ATR c.2634-1G>A, p.? (NM_001184.4) (VAF: 15%)STK11 Loss - Equivocal"
    end
    generate seq = _n
    split text, generate(maybe) parse(")")
    local maybes `r(varlist)'
    foreach v of local maybes {
        replace `v' = ustrregexrf(`v'," .*","")
    }
    reshape long maybe, i(seq) j(which)
    drop if maybe==""
    bysort seq (which): replace which = _n
    reshape wide maybe, i(seq) j(which)
    list seq maybe*, noobs clean
    Code:
    . list seq maybe*, noobs clean
    
        seq   maybe1   maybe2   maybe3   maybe4  
          1   DNMT3A     IDH2     JAK2           
          2     CUX1                             
          3    SF3B1      WT1      ATR    STK11

    Comment


    • #3
      Originally posted by William Lisowski View Post
      I don't really understand what you seek. What does DNMT3A.* | IDH2.* |JAK2.* mean in the context of what you seek? Can you not just copy and paste EXACTLY the values you want in your comut variables to help us understand what you are looking for?

      Perhaps this will start you in a useful direction - the first step is to split each string on each right parenthesis.
      Code:
      input str300 text
      "DNMT3A c.2645G>A, p.R882H (NM_022552.4) (VAF: 38%)IDH2 c.419G>A, p.R140Q (NM_002168.3) (VAF: 37%)JAK2 c.1849G>T, p.V617F (NM_004972.3) (VAF: 71%)"
      "CUX1 Loss - EquivocalWT1 Loss - EquivocalAPC Loss EquivocalBRCA1 Loss - EquivocalEZH2 Loss - EquivocalNF1 Loss"
      "SF3B1 c.2098A>G, p.K700E (NM_012433.4) (VAF: 12%)WT1 c.1156_1159dup, p.A387Vfs*4 (NM_024426.6) (VAF: 5%)ATR c.2634-1G>A, p.? (NM_001184.4) (VAF: 15%)STK11 Loss - Equivocal"
      end
      generate seq = _n
      split text, generate(maybe) parse(")")
      local maybes `r(varlist)'
      foreach v of local maybes {
      replace `v' = ustrregexrf(`v'," .*","")
      }
      reshape long maybe, i(seq) j(which)
      drop if maybe==""
      bysort seq (which): replace which = _n
      reshape wide maybe, i(seq) j(which)
      list seq maybe*, noobs clean
      Code:
      . list seq maybe*, noobs clean
      
      seq maybe1 maybe2 maybe3 maybe4
      1 DNMT3A IDH2 JAK2
      2 CUX1
      3 SF3B1 WT1 ATR STK11
      Sorry for the confusion, William Lisowski. And thanks much for that code. What I really wanted at the end was (just first co-mutation for first case depicted below):
      Comut1 Comut1type Comut1vaf
      DNMT3A c.2645G>A, p.R882H (NM_022552.4) 38

      Even if I could split these three major chunks/gene, I should be able to split the smaller bits more easily. It was the second one with the lower case letter following the gene that threw me off with trying to split the gene chunks per observation. The parse seems to skip the second entry which does not have any ")".
      Last edited by Girish Venkataraman; 17 Dec 2022, 11:25. Reason: Accidental post before completion

      Comment

      Working...
      X